Whats Covered
- Mutating joins
- Filtering joins and set operations
- Assembling data
- Advanced joining
- Case Study
Aditional Resources
Assembling data
Binds
- Base R binds
rbind,cbind
- dplyr binds
bind_rows,bind_cols
- Advantages of dplyr versions
- faster
- return a tibble (advanced data frame)
- can handle lists of dataframes
.idargument inbind_rowsallows you to pass in a name for each source dataframe- this will be added as a column to the result dataset
- this is really handy when one of the variables for the data set is stored in the table name.
- e.g. houly weather data for different days
rbindwill return an error if the column names do not match exactly.bind_rowswill create a column for each unique column and distributeNAs appropriately
Which bind?
side_one and side_two contain tracks from sides one and two, respectively, of Pink Floyd’s famous album The Dark Side of the Moon.
Bind these datasets into a single table using a dplyr function. Which type of bind should you use?
# Examine side_one and side_two
side_one %>%
# Create pretty table
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| song | length |
|---|---|
| Speak to Me | 01:30:00 |
| Breathe | 02:43:00 |
| On the Run | 03:30:00 |
| Time | 06:53:00 |
| The Great Gig in the Sky | 04:15:00 |
side_two %>%
# Create pretty table
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| song | length |
|---|---|
| Money | 06:30:00 |
| Us and Them | 07:51:00 |
| Any Colour You Like | 03:24:00 |
| Brain Damage | 03:50:00 |
| Eclipse | 02:03:00 |
# Bind side_one and side_two into a single dataset
side_one %>%
bind_rows(side_two) %>%
# Create pretty table
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| song | length |
|---|---|
| Speak to Me | 01:30:00 |
| Breathe | 02:43:00 |
| On the Run | 03:30:00 |
| Time | 06:53:00 |
| The Great Gig in the Sky | 04:15:00 |
| Money | 06:30:00 |
| Us and Them | 07:51:00 |
| Any Colour You Like | 03:24:00 |
| Brain Damage | 03:50:00 |
| Eclipse | 02:03:00 |
Bind rows
discography and jimi contain all of the information you need to create an anthology dataset for the band The Jimi Hendrix Experience.
discography contains a data frame of each album by The Jimi Hendrix Experience and the year of the album.
jimi contains a list of data frames of album tracks, one for each album released by The Jimi Hendrix Experience. As explained, you can pass bind_rows() a list of data frames like jimi to bind together into a single data frame.
# Examine discography and jimi
discography## # A tibble: 3 x 2
## album year
## <chr> <int>
## 1 Are You Experienced 1967
## 2 Axis: Bold as Love 1967
## 3 Electric Ladyland 1968
jimi## $`Are You Experienced`
## # A tibble: 10 x 2
## song length
## <chr> <time>
## 1 Purple Haze 02:46
## 2 Manic Depression 03:46
## 3 Hey Joe 03:23
## 4 May This Be Love 03:14
## 5 I Don't Live Today 03:55
## 6 The Wind Cries Mary 03:21
## 7 Fire 02:34
## 8 Third Stone from the Sun 06:40
## 9 Foxy Lady 03:15
## 10 Are You Experienced? 03:55
##
## $`Axis: Bold As Love`
## # A tibble: 13 x 2
## song length
## <chr> <time>
## 1 EXP 01:55
## 2 Up from the Skies 02:55
## 3 Spanish Castle Magic 03:00
## 4 Wait Until Tomorrow 03:00
## 5 Ain't No Telling 01:46
## 6 Little Wing 02:24
## 7 If 6 was 9 05:32
## 8 You Got Me Floatin 02:45
## 9 Castles Made of Sand 02:46
## 10 She's So Fine 02:37
## 11 One Rainy Wish 03:40
## 12 Little Miss Lover 02:20
## 13 Bold as Love 04:11
##
## $`Electric Ladyland`
## # A tibble: 16 x 2
## song length
## <chr> <time>
## 1 And the Gods Made Love 01:21
## 2 Have You Ever Been (To Electric Ladyland) 02:11
## 3 Crosstown Traffic 02:25
## 4 Voodoo Chile 15:00
## 5 Little Miss Strange 02:52
## 6 Long Hot Summer Night 03:27
## 7 Come On (Part 1) 04:09
## 8 Gypsy Eyes 03:43
## 9 Burning of the Midnight Lamp 03:39
## 10 Rainy Day, Dream Away 03:42
## 11 1983... (A Merman I Should Turn to Be) 13:39
## 12 Moon, Turn the Tides... Gently Gently Away 01:02
## 13 Still Raining, Still Dreaming 04:25
## 14 House Burning Down 04:33
## 15 All Along the Watchtower 04:01
## 16 Voodoo Child (Slight Return) 05:12
jimi %>%
# Bind jimi by rows, with ID column "album"
bind_rows(.id = c("album")) %>%
# Make a complete data frame
left_join(discography, by = c("album")) %>%
# Create pretty table
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| album | song | length | year |
|---|---|---|---|
| Are You Experienced | Purple Haze | 02:46:00 | 1967 |
| Are You Experienced | Manic Depression | 03:46:00 | 1967 |
| Are You Experienced | Hey Joe | 03:23:00 | 1967 |
| Are You Experienced | May This Be Love | 03:14:00 | 1967 |
| Are You Experienced | I Don’t Live Today | 03:55:00 | 1967 |
| Are You Experienced | The Wind Cries Mary | 03:21:00 | 1967 |
| Are You Experienced | Fire | 02:34:00 | 1967 |
| Are You Experienced | Third Stone from the Sun | 06:40:00 | 1967 |
| Are You Experienced | Foxy Lady | 03:15:00 | 1967 |
| Are You Experienced | Are You Experienced? | 03:55:00 | 1967 |
| Axis: Bold As Love | EXP | 01:55:00 | NA |
| Axis: Bold As Love | Up from the Skies | 02:55:00 | NA |
| Axis: Bold As Love | Spanish Castle Magic | 03:00:00 | NA |
| Axis: Bold As Love | Wait Until Tomorrow | 03:00:00 | NA |
| Axis: Bold As Love | Ain’t No Telling | 01:46:00 | NA |
| Axis: Bold As Love | Little Wing | 02:24:00 | NA |
| Axis: Bold As Love | If 6 was 9 | 05:32:00 | NA |
| Axis: Bold As Love | You Got Me Floatin | 02:45:00 | NA |
| Axis: Bold As Love | Castles Made of Sand | 02:46:00 | NA |
| Axis: Bold As Love | She’s So Fine | 02:37:00 | NA |
| Axis: Bold As Love | One Rainy Wish | 03:40:00 | NA |
| Axis: Bold As Love | Little Miss Lover | 02:20:00 | NA |
| Axis: Bold As Love | Bold as Love | 04:11:00 | NA |
| Electric Ladyland | And the Gods Made Love | 01:21:00 | 1968 |
| Electric Ladyland | Have You Ever Been (To Electric Ladyland) | 02:11:00 | 1968 |
| Electric Ladyland | Crosstown Traffic | 02:25:00 | 1968 |
| Electric Ladyland | Voodoo Chile | 15:00:00 | 1968 |
| Electric Ladyland | Little Miss Strange | 02:52:00 | 1968 |
| Electric Ladyland | Long Hot Summer Night | 03:27:00 | 1968 |
| Electric Ladyland | Come On (Part 1) | 04:09:00 | 1968 |
| Electric Ladyland | Gypsy Eyes | 03:43:00 | 1968 |
| Electric Ladyland | Burning of the Midnight Lamp | 03:39:00 | 1968 |
| Electric Ladyland | Rainy Day, Dream Away | 03:42:00 | 1968 |
| Electric Ladyland | 1983… (A Merman I Should Turn to Be) | 13:39:00 | 1968 |
| Electric Ladyland | Moon, Turn the Tides… Gently Gently Away | 01:02:00 | 1968 |
| Electric Ladyland | Still Raining, Still Dreaming | 04:25:00 | 1968 |
| Electric Ladyland | House Burning Down | 04:33:00 | 1968 |
| Electric Ladyland | All Along the Watchtower | 04:01:00 | 1968 |
| Electric Ladyland | Voodoo Child (Slight Return) | 05:12:00 | 1968 |
Bind columns
Let’s make a compilation of Hank Williams’ 67 singles. To do this, you can use hank_years and hank_charts:
hank_yearscontains the name and release year of each of Hank Williams’ 67 singles.
hank_chartscontains the name of each of Hank Williams’ 67 singles as well as the highest position it earned on the Billboard sales charts.
Each dataset contains the same songs, but hank_years is arranged chronologically by year, while hank_charts is arranged alphabetically by song title.
# Examine hank_years and hank_charts
hank_years %>%
# Create pretty table
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| year | song |
|---|---|
| 1947 | Move It On Over |
| 1947 | My Love for You (Has Turned to Hate) |
| 1947 | Never Again (Will I Knock on Your Door) |
| 1947 | On the Banks of the Old Ponchartrain |
| 1947 | Pan American |
| 1947 | Wealth Won’t Save Your Soul |
| 1948 | A Mansion on the Hill |
| 1948 | Honky Tonkin’ |
| 1948 | I Saw the Light |
| 1948 | I’m a Long Gone Daddy |
| 1948 | My Sweet Love Ain’t Around |
| 1949 | I’m So Lonesome I Could Cry |
| 1949 | Lost Highway |
| 1949 | Lovesick Blues |
| 1949 | Mind Your Own Business |
| 1949 | My Bucket’s Got a Hole in It |
| 1949 | Never Again (Will I Knock on Your Door) |
| 1949 | Wedding Bells |
| 1949 | You’re Gonna Change (Or I’m Gonna Leave) |
| 1950 | I Just Don’t Like This Kind of Living |
| 1950 | Long Gone Lonesome Blues |
| 1950 | Moanin’ the Blues |
| 1950 | My Son Calls Another Man Daddy |
| 1950 | Nobody’s Lonesome for Me |
| 1950 | They’ll Never Take Her Love from Me |
| 1950 | Why Don’t You Love Me |
| 1950 | Why Should We Try Anymore |
| 1951 | (I Heard That) Lonesome Whistle |
| 1951 | Baby, We’re Really in Love |
| 1951 | Cold, Cold Heart |
| 1951 | Crazy Heart |
| 1951 | Dear John |
| 1951 | Hey Good Lookin’ |
| 1951 | Howlin’ At the Moon |
| 1951 | I Can’t Help It (If I’m Still in Love With You) |
| 1952 | Half as Much |
| 1952 | Honky Tonk Blues |
| 1952 | I’ll Never Get Out of This World Alive |
| 1952 | Jambalaya (On the Bayou) |
| 1952 | Settin’ the Woods on Fire |
| 1952 | You Win Again |
| 1953 | Calling You |
| 1953 | I Won’t Be Home No More |
| 1953 | Kaw-Liga |
| 1953 | Take These Chains from My Heart |
| 1953 | Weary Blues from Waitin’ |
| 1953 | Your Cheatin’ Heart |
| 1954 | (I’m Gonna) Sing, Sing, Sing |
| 1954 | How Can You Refuse Him Now |
| 1954 | I’m Satisfied with You |
| 1954 | You Better Keep It on Your Mind |
| 1955 | A Teardrop on a Rose |
| 1955 | At the First Fall of Snow |
| 1955 | Mother Is Gone |
| 1955 | Please Don’t Let Me Love You |
| 1955 | Thank God |
| 1956 | A Home in Heaven |
| 1956 | California Zephyr |
| 1956 | Singing Waterfall |
| 1956 | There’s No Room in My Heart for the Blues |
| 1957 | Leave Me Alone with the Blues |
| 1957 | Ready to Go Home |
| 1957 | The Waltz of the Wind |
| 1958 | Just Waitin’ |
| 1965 | The Pale Horse and His Rider |
| 1966 | Kaw-Liga |
| 1989 | There’s a Tear in My Beer |
hank_charts %>%
# Create pretty table
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| song | peak |
|---|---|
| (I Heard That) Lonesome Whistle | 9 |
| (I’m Gonna) Sing, Sing, Sing | NA |
| A Home in Heaven | NA |
| A Mansion on the Hill | 12 |
| A Teardrop on a Rose | NA |
| At the First Fall of Snow | NA |
| Baby, We’re Really in Love | 4 |
| California Zephyr | NA |
| Calling You | NA |
| Cold, Cold Heart | 1 |
| Crazy Heart | 4 |
| Dear John | 8 |
| Half as Much | 2 |
| Hey Good Lookin’ | 1 |
| Honky Tonk Blues | 2 |
| Honky Tonkin’ | 14 |
| How Can You Refuse Him Now | NA |
| Howlin’ At the Moon | 3 |
| I Can’t Help It (If I’m Still in Love With You) | 2 |
| I Just Don’t Like This Kind of Living | 5 |
| I Saw the Light | NA |
| I Won’t Be Home No More | 4 |
| I’ll Never Get Out of This World Alive | 1 |
| I’m a Long Gone Daddy | 6 |
| I’m Satisfied with You | NA |
| I’m So Lonesome I Could Cry | 2 |
| Jambalaya (On the Bayou) | 1 |
| Just Waitin’ | NA |
| Kaw-Liga | 1 |
| Kaw-Liga | NA |
| Leave Me Alone with the Blues | NA |
| Long Gone Lonesome Blues | 1 |
| Lost Highway | 12 |
| Lovesick Blues | 1 |
| Mind Your Own Business | 5 |
| Moanin’ the Blues | 1 |
| Mother Is Gone | NA |
| Move It On Over | 4 |
| My Bucket’s Got a Hole in It | 2 |
| My Love for You (Has Turned to Hate) | NA |
| My Son Calls Another Man Daddy | 9 |
| My Sweet Love Ain’t Around | NA |
| Never Again (Will I Knock on Your Door) | NA |
| Never Again (Will I Knock on Your Door) | 6 |
| Nobody’s Lonesome for Me | 9 |
| On the Banks of the Old Ponchartrain | NA |
| Pan American | NA |
| Please Don’t Let Me Love You | 9 |
| Ready to Go Home | NA |
| Settin’ the Woods on Fire | 2 |
| Singing Waterfall | NA |
| Take These Chains from My Heart | 1 |
| Thank God | NA |
| The Pale Horse and His Rider | NA |
| The Waltz of the Wind | NA |
| There’s a Tear in My Beer | 7 |
| There’s No Room in My Heart for the Blues | NA |
| They’ll Never Take Her Love from Me | 5 |
| Wealth Won’t Save Your Soul | NA |
| Weary Blues from Waitin’ | 7 |
| Wedding Bells | 2 |
| Why Don’t You Love Me | 1 |
| Why Should We Try Anymore | 9 |
| You Better Keep It on Your Mind | NA |
| You Win Again | 10 |
| You’re Gonna Change (Or I’m Gonna Leave) | 4 |
| Your Cheatin’ Heart | 1 |
hank_years %>%
# Reorder hank_years alphabetically by song title
arrange(song) %>%
# Select just the year column
select(year) %>%
# Bind the year column
bind_cols(hank_charts) %>%
# Arrange the finished dataset
arrange(year, song) %>%
# Create pretty table
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| year | song | peak |
|---|---|---|
| 1947 | Move It On Over | 4 |
| 1947 | My Love for You (Has Turned to Hate) | NA |
| 1947 | Never Again (Will I Knock on Your Door) | NA |
| 1947 | On the Banks of the Old Ponchartrain | NA |
| 1947 | Pan American | NA |
| 1947 | Wealth Won’t Save Your Soul | NA |
| 1948 | A Mansion on the Hill | 12 |
| 1948 | Honky Tonkin’ | 14 |
| 1948 | I’m Satisfied with You | NA |
| 1948 | I Just Don’t Like This Kind of Living | 5 |
| 1948 | My Sweet Love Ain’t Around | NA |
| 1949 | I Won’t Be Home No More | 4 |
| 1949 | Lost Highway | 12 |
| 1949 | Lovesick Blues | 1 |
| 1949 | Mind Your Own Business | 5 |
| 1949 | My Bucket’s Got a Hole in It | 2 |
| 1949 | Never Again (Will I Knock on Your Door) | 6 |
| 1949 | Wedding Bells | 2 |
| 1949 | You Better Keep It on Your Mind | NA |
| 1950 | I’m a Long Gone Daddy | 6 |
| 1950 | Long Gone Lonesome Blues | 1 |
| 1950 | Moanin’ the Blues | 1 |
| 1950 | My Son Calls Another Man Daddy | 9 |
| 1950 | Nobody’s Lonesome for Me | 9 |
| 1950 | They’ll Never Take Her Love from Me | 5 |
| 1950 | Why Don’t You Love Me | 1 |
| 1950 | Why Should We Try Anymore | 9 |
| 1951 | (I’m Gonna) Sing, Sing, Sing | NA |
| 1951 | Baby, We’re Really in Love | 4 |
| 1951 | Cold, Cold Heart | 1 |
| 1951 | Crazy Heart | 4 |
| 1951 | Dear John | 8 |
| 1951 | Hey Good Lookin’ | 1 |
| 1951 | Howlin’ At the Moon | 3 |
| 1951 | I’ll Never Get Out of This World Alive | 1 |
| 1952 | Half as Much | 2 |
| 1952 | Honky Tonk Blues | 2 |
| 1952 | I Can’t Help It (If I’m Still in Love With You) | 2 |
| 1952 | Jambalaya (On the Bayou) | 1 |
| 1952 | Settin’ the Woods on Fire | 2 |
| 1952 | You’re Gonna Change (Or I’m Gonna Leave) | 4 |
| 1953 | Calling You | NA |
| 1953 | I’m So Lonesome I Could Cry | 2 |
| 1953 | Kaw-Liga | 1 |
| 1953 | Take These Chains from My Heart | 1 |
| 1953 | Weary Blues from Waitin’ | 7 |
| 1953 | Your Cheatin’ Heart | 1 |
| 1954 | (I Heard That) Lonesome Whistle | 9 |
| 1954 | How Can You Refuse Him Now | NA |
| 1954 | I Saw the Light | NA |
| 1954 | You Win Again | 10 |
| 1955 | A Teardrop on a Rose | NA |
| 1955 | At the First Fall of Snow | NA |
| 1955 | Mother Is Gone | NA |
| 1955 | Please Don’t Let Me Love You | 9 |
| 1955 | Thank God | NA |
| 1956 | A Home in Heaven | NA |
| 1956 | California Zephyr | NA |
| 1956 | Singing Waterfall | NA |
| 1956 | There’s No Room in My Heart for the Blues | NA |
| 1957 | Leave Me Alone with the Blues | NA |
| 1957 | Ready to Go Home | NA |
| 1957 | The Waltz of the Wind | NA |
| 1958 | Just Waitin’ | NA |
| 1965 | The Pale Horse and His Rider | NA |
| 1966 | Kaw-Liga | NA |
| 1989 | There’s a Tear in My Beer | 7 |
Danger
- The biggest risk when using
bind_colsis that the observations may not align acros the rows of your data frame
- Indeed, you will create meaningless data if you column bind two datasets whose rows do not align. Unfortunately, there is usually no clear way to tell whether or not the rows in two datasets align unless the datasets contain a mutual key. In that scenario, you can use a mutating join to bind the datasets in a foolproof way.
Build a better data frame
- R Base way to create a dataframe
data.frame()andas.data.frame()
- dplyr way to create a dataframe
data_frame()andas_data_frame()
- Advantages of using
data_frame():- will never change the data type of a vector. (e.g. strings to factors)
- will never add row names
- will not changes unusual column names
- only recycles length 1 inputs
- evaluate lazily and in order. (So you can reference a column in the next column definition)
- outputs a tibble (class tbl_df)
as_data_framedoes all these same things- its just used on a list of columns
Make a data frame
Let’s make a Greatest Hits compilation for Hank Williams. hank_year, hank_song, and hank_peak contain the columns of the data frame you made in the last exercise.
# Make combined data frame using data_frame()
data_frame(year = hank_year, song = hank_song, peak = hank_peak) %>%
# Extract songs where peak equals 1
filter(peak == 1) %>%
# Create pretty table
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| year | song | peak |
|---|---|---|
| 1949 | Lovesick Blues | 1 |
| 1950 | Long Gone Lonesome Blues | 1 |
| 1950 | Moanin’ the Blues | 1 |
| 1950 | Why Don’t You Love Me | 1 |
| 1951 | Cold, Cold Heart | 1 |
| 1951 | Hey Good Lookin’ | 1 |
| 1952 | I’ll Never Get Out of This World Alive | 1 |
| 1952 | Jambalaya (On the Bayou) | 1 |
| 1953 | Kaw-Liga | 1 |
| 1953 | Take These Chains from My Heart | 1 |
| 1953 | Your Cheatin’ Heart | 1 |
Lists of columns
As a data scientist, you should always be prepared to handle raw data that comes in many different formats.
hank saves Hank Williams’ singles in a different way, as a list of vectors. Can you turn hank into the same dataset that you made in the last exercise?
# Examine the contents of hank
str(hank)## List of 3
## $ year: int [1:67] 1947 1947 1947 1947 1947 1947 1948 1948 1948 1948 ...
## $ song: chr [1:67] "Move It On Over" "My Love for You (Has Turned to Hate)" "Never Again (Will I Knock on Your Door)" "On the Banks of the Old Ponchartrain" ...
## $ peak: int [1:67] 4 NA NA NA NA NA 12 14 NA 2 ...
# Convert the hank list into a data frame
as_data_frame(hank) %>%
# Extract songs where peak equals 1
filter(peak == 1) %>%
# Create pretty table
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| year | song | peak |
|---|---|---|
| 1949 | Lovesick Blues | 1 |
| 1950 | Long Gone Lonesome Blues | 1 |
| 1950 | Moanin’ the Blues | 1 |
| 1950 | Why Don’t You Love Me | 1 |
| 1951 | Cold, Cold Heart | 1 |
| 1951 | Hey Good Lookin’ | 1 |
| 1952 | I’ll Never Get Out of This World Alive | 1 |
| 1952 | Jambalaya (On the Bayou) | 1 |
| 1953 | Kaw-Liga | 1 |
| 1953 | Take These Chains from My Heart | 1 |
| 1953 | Your Cheatin’ Heart | 1 |
Lists of rows (data frames)
michael contains a list of data frames, one for each album released by Michael Jackson. The code in the editor attempts to bind the data frames into a single data frame and then extract a data frame of the top tracks on each album.
However, the code runs into a problem. The commented line fails because as_data_frame() combines a list of column vectors into a data frame. However, michael is a list of data frames.
# Examine the contents of michael
michael## $`Got to Be There`
## # A tibble: 10 x 2
## song peak
## <chr> <int>
## 1 Ain'T No Sunshine NA
## 2 I Wanna be Where You Are NA
## 3 Girl Don't Take Your Love from Me NA
## 4 In Our Small Way NA
## 5 Got to Be There 4
## 6 Rockin' Robin 2
## 7 Wings of My Love NA
## 8 Maria (You Were the Only One) NA
## 9 Love is Here and Now You're Gone NA
## 10 You've Got a Friend NA
##
## $Ben
## # A tibble: 10 x 2
## song peak
## <chr> <int>
## 1 Ben 1
## 2 Greatest Show On Earth NA
## 3 People Make the World Go 'Round NA
## 4 We've Got a Good Thing NA
## 5 Everybody's Somebody's Fool NA
## 6 My Girl NA
## 7 What Goes Around Comes Around NA
## 8 In Our Small Way NA
## 9 Shoo-Be-Doo-Be-Doo-Da-Day NA
## 10 You Can Cry On My Shoulder NA
##
## $`Music & Me`
## # A tibble: 10 x 2
## song peak
## <chr> <int>
## 1 With a Child's Heart 50
## 2 Up Again NA
## 3 All the Things You Are NA
## 4 Happy NA
## 5 Too Young NA
## 6 Doggin' Around NA
## 7 Euphoria NA
## 8 Morning Glow NA
## 9 Johnny Raven NA
## 10 Music and Me NA
##
## $`Forever, Michael`
## # A tibble: 10 x 2
## song peak
## <chr> <int>
## 1 We're Almost There 54
## 2 Take Me Back NA
## 3 One Day in Your Life NA
## 4 Cinderella Stay Awhile NA
## 5 We've Got Forever NA
## 6 Just a Little Bit of You 23
## 7 You Are There NA
## 8 Dapper-Dan NA
## 9 Dear Michael NA
## 10 I'll Come Home to You NA
##
## $`Off the Wall`
## # A tibble: 10 x 2
## song peak
## <chr> <int>
## 1 Don't Stop 'Til You Get Enough 1
## 2 Rock with You 1
## 3 Working Day and Night NA
## 4 Get on the Floor NA
## 5 Off the Wall 10
## 6 Girlfriend NA
## 7 She's Out of My Life 10
## 8 I Can't Help It NA
## 9 It's the Falling in Love NA
## 10 Burn This Disco Out NA
##
## $Thriller
## # A tibble: 9 x 2
## song peak
## <chr> <int>
## 1 Wanna Be Startin' Somethin 5
## 2 Baby be Mine NA
## 3 The Girl is Mine 2
## 4 Thriller 4
## 5 Beat It 1
## 6 Billie Jean 1
## 7 Human Nature 7
## 8 P.Y.T. (Pretty Young Thing) 10
## 9 The Lady in My Life NA
##
## $Bad
## # A tibble: 10 x 2
## song peak
## <chr> <int>
## 1 Bad 1
## 2 The Way You Make Me Feel 1
## 3 Speed Demon NA
## 4 Liberian Girl NA
## 5 Just Good Friends NA
## 6 Another Part of Me 11
## 7 Man in the Mirror 1
## 8 I Just Can't Stop Loving You 1
## 9 Dirty Diana 1
## 10 Smooth Criminal 7
##
## $Dangerous
## # A tibble: 14 x 2
## song peak
## <chr> <int>
## 1 Jam 26
## 2 Why You Wanna Trip on Me NA
## 3 In the Closet 6
## 4 She Drives Me Wild NA
## 5 Remember the Time 3
## 6 Can't Let Her Get Away NA
## 7 Heal the World 27
## 8 Black or White 1
## 9 Who Is It 14
## 10 Give In to Me NA
## 11 Will You Be There 7
## 12 Keep the Faith NA
## 13 Gone Too Soon NA
## 14 Dangerous NA
##
## $HIStory
## # A tibble: 30 x 2
## song peak
## <chr> <int>
## 1 Billie Jean NA
## 2 The Way You Make Me Feel NA
## 3 Black or White NA
## 4 Rock with You NA
## 5 She's Out of My Life NA
## 6 Bad NA
## 7 I Just Can't Stop Loving You NA
## 8 Man in the Mirror NA
## 9 Thriller NA
## 10 Beat It NA
## # ... with 20 more rows
##
## $Invincible
## # A tibble: 16 x 2
## song peak
## <chr> <int>
## 1 Unbreakable NA
## 2 Heartbreaker NA
## 3 Invincible NA
## 4 Break of Dawn NA
## 5 Heaven Can Wait NA
## 6 You Rock My World 10
## 7 Butterflies NA
## 8 Speechless NA
## 9 2000 Watts NA
## 10 You Are My Life NA
## 11 Privacy NA
## 12 Don't Walk Away NA
## 13 Cry NA
## 14 The Lost Children NA
## 15 Whatever Happens NA
## 16 Threatened NA
# Replace the first line so each album has its own rows
bind_rows(michael, .id = c("album")) %>%
group_by(album) %>%
mutate(rank = min_rank(peak)) %>%
filter(rank == 1) %>%
select(-rank, -peak) %>%
# Create pretty table
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| album | song |
|---|---|
| Got to Be There | Rockin’ Robin |
| Ben | Ben |
| Music & Me | With a Child’s Heart |
| Forever, Michael | Just a Little Bit of You |
| Off the Wall | Don’t Stop ’Til You Get Enough |
| Off the Wall | Rock with You |
| Thriller | Beat It |
| Thriller | Billie Jean |
| Bad | Bad |
| Bad | The Way You Make Me Feel |
| Bad | Man in the Mirror |
| Bad | I Just Can’t Stop Loving You |
| Bad | Dirty Diana |
| Dangerous | Black or White |
| HIStory | You Are Not Alone |
| Invincible | You Rock My World |
Working with data types
Be aware of data types when combining vectors or data frames
- R will need to make a decision about the resulting type if the original types differ for a vector/column
<div style="width:300px; height=200px">

</div>
- 6 atomic data types in R
- logical
- character
- double
- integer
- complex
- raw
typeof(TRUE)## [1] "logical"
typeof("hello")## [1] "character"
typeof(3.14)## [1] "double"
typeof(1L)## [1] "integer"
typeof(1 + 2i)## [1] "complex"
typeof(raw(1))## [1] "raw"
Atomic data types
- R will coerce logicals and numerics to strings if strings are present
- R will coerce logicals to numerics if numerics are present
- as.character() will coerce factors to their labels as strings if strings are present
- as.numeric() will coerce factors to their levels as numerics if numerics are present
dplyr’s coercion rules
Base R coercion rules:
- in general the more specific type of data will convert to the more general
- integer and logical will go to integer
- double and integer/locgical will go to double
- character and integer/double/logical will go to character
- factors are tricky
- if joining to an integer or double the factor integer values will be used
- if joining to a character the factor labels will be used
- be careful if you have numeric labels for your factors
dplyr’s coercion rules:
- dplyr functions will not automatically coerce data types
- returns and error
- expects you to manyally coerce data
- Exception: factors
- dplyr coverts non-aligning factors to strings
- gives warning message
dplyr and coercion
- dplyr’s two table functions will
- coerce two factors to a string if the factors have different levels
- coerce two factors to a string if the factors have the same levels in a different order
- return an error instead of coercing logicals, integers, and numerics to a character
Determining type
- eighties has the year variable stores as a factor
- sevnties has it stored as a character
- when combined it will be a character
- we will get a warning message
seventies %>%
# Create pretty table
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| year | album | band |
|---|---|---|
| 1970 | Bridge Over Troubled Water | Simon and Garfunkel |
| 1971 | Jesus Christ Superstar | Various Artists |
| 1972 | Harvest | Neil Young |
| 1973 | The World is a Ghetto | War |
| 1974 | Goodbye Yellow Brick Road | Elton John |
| 1975 | Elton John’s Greatest Hits | Elton John |
| 1976 | Peter Frampton | Frampton Comes Alive |
| 1977 | Rumours | Fleetwood Mac |
| 1978 | Saturday Night Fever | Bee Gees |
| 1979 | Billy Joel | 52nd Street |
eighties %>%
# Create pretty table
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| year | album | band |
|---|---|---|
| 1980 | The Wall | Pink Floyd |
| 1981 | Hi Infidelity | REO Speedwagon |
| 1982 | Asia | Asia |
| 1983 | Thriller | Michael Jackson |
| 1984 | Thriller | Michael Jackson |
| 1985 | Born in the U.S.A. | Bruce Springsteen |
| 1986 | Whitney Houston | Whitney Houston |
| 1987 | Slippery When Wet | Bon Jovi |
| 1988 | Faith | George Michael |
| 1989 | Don’t Be Cruel | Bobby Brown |
both <- seventies %>% bind_rows(eighties)## Warning in bind_rows_(x, .id): binding factor and character vector,
## coercing into character vector
## Warning in bind_rows_(x, .id): binding character and factor vector,
## coercing into character vector
both %>%
# Create pretty table
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| year | album | band |
|---|---|---|
| 1970 | Bridge Over Troubled Water | Simon and Garfunkel |
| 1971 | Jesus Christ Superstar | Various Artists |
| 1972 | Harvest | Neil Young |
| 1973 | The World is a Ghetto | War |
| 1974 | Goodbye Yellow Brick Road | Elton John |
| 1975 | Elton John’s Greatest Hits | Elton John |
| 1976 | Peter Frampton | Frampton Comes Alive |
| 1977 | Rumours | Fleetwood Mac |
| 1978 | Saturday Night Fever | Bee Gees |
| 1979 | Billy Joel | 52nd Street |
| 1980 | The Wall | Pink Floyd |
| 1981 | Hi Infidelity | REO Speedwagon |
| 1982 | Asia | Asia |
| 1983 | Thriller | Michael Jackson |
| 1984 | Thriller | Michael Jackson |
| 1985 | Born in the U.S.A. | Bruce Springsteen |
| 1986 | Whitney Houston | Whitney Houston |
| 1987 | Slippery When Wet | Bon Jovi |
| 1988 | Faith | George Michael |
| 1989 | Don’t Be Cruel | Bobby Brown |
Results
sixties contains the top selling albums in the US in the 1960s. It stores year as a numeric (double). When you combine it with seventies, which stores year as a factor, bind_rows() returns an error.
You can fix this by coercing seventies$year to a numeric. But if you do it like this, something surprising happens.
seventies %>%
mutate(year = as.numeric(year)) %>%
# Create pretty table
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| year | album | band |
|---|---|---|
| 1 | Bridge Over Troubled Water | Simon and Garfunkel |
| 2 | Jesus Christ Superstar | Various Artists |
| 3 | Harvest | Neil Young |
| 4 | The World is a Ghetto | War |
| 5 | Goodbye Yellow Brick Road | Elton John |
| 6 | Elton John’s Greatest Hits | Elton John |
| 7 | Peter Frampton | Frampton Comes Alive |
| 8 | Rumours | Fleetwood Mac |
| 9 | Saturday Night Fever | Bee Gees |
| 10 | Billy Joel | 52nd Street |
Can you fix things?
# sixties$year is a numeric and seventies$year is a factor
str(sixties)## Classes 'tbl_df', 'tbl' and 'data.frame': 10 obs. of 3 variables:
## $ year : num 1960 1961 1962 1963 1964 ...
## $ album: chr "The Sound of Music" "Camelot" "West Side Story" "West Side Story" ...
## $ band : chr "Original Broadway Cast" "Original Broadway Cast" "Soundtrack" "Soundtrack" ...
str(seventies)## Classes 'tbl_df', 'tbl' and 'data.frame': 10 obs. of 3 variables:
## $ year : Factor w/ 10 levels "1970","1971",..: 1 2 3 4 5 6 7 8 9 10
## $ album: chr "Bridge Over Troubled Water" "Jesus Christ Superstar" "Harvest" "The World is a Ghetto" ...
## $ band : chr "Simon and Garfunkel" "Various Artists" "Neil Young" "War" ...
# This gives an error about not converting factor to numeric (i.e Column `year` can't be converted from factor to numeric)
#seventies %>% bind_rows(sixties)
# So lets convert it....
# But this converts the year to the integer factor value, not the year. Whoops
seventies %>%
mutate(year = as.numeric(year))## # A tibble: 10 x 3
## year album band
## <dbl> <chr> <chr>
## 1 1 Bridge Over Troubled Water Simon and Garfunkel
## 2 2 Jesus Christ Superstar Various Artists
## 3 3 Harvest Neil Young
## 4 4 The World is a Ghetto War
## 5 5 Goodbye Yellow Brick Road Elton John
## 6 6 Elton John's Greatest Hits Elton John
## 7 7 Peter Frampton Frampton Comes Alive
## 8 8 Rumours Fleetwood Mac
## 9 9 Saturday Night Fever Bee Gees
## 10 10 Billy Joel 52nd Street
# We need to make it a character first to get the year, then convert to numeric
seventies %>% mutate(year = as.numeric(as.character(year)))## # A tibble: 10 x 3
## year album band
## <dbl> <chr> <chr>
## 1 1970 Bridge Over Troubled Water Simon and Garfunkel
## 2 1971 Jesus Christ Superstar Various Artists
## 3 1972 Harvest Neil Young
## 4 1973 The World is a Ghetto War
## 5 1974 Goodbye Yellow Brick Road Elton John
## 6 1975 Elton John's Greatest Hits Elton John
## 7 1976 Peter Frampton Frampton Comes Alive
## 8 1977 Rumours Fleetwood Mac
## 9 1978 Saturday Night Fever Bee Gees
## 10 1979 Billy Joel 52nd Street
# Now we can join the datasets correctly
seventies %>%
# Coerce seventies$year into a useful numeric
mutate(year = as.numeric(as.character(year))) %>%
# Bind the updated version of seventies to sixties
bind_rows(sixties) %>%
arrange(year) %>%
# Create pretty table
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "left", font_size = 11) %>%
row_spec(0, bold = T, color = "white", background = "#3f7689")| year | album | band |
|---|---|---|
| 1960 | The Sound of Music | Original Broadway Cast |
| 1961 | Camelot | Original Broadway Cast |
| 1962 | West Side Story | Soundtrack |
| 1963 | West Side Story | Soundtrack |
| 1964 | Hello, Dolly! | Original Broadway Cast |
| 1965 | Mary Poppins | Soundtrack |
| 1966 | Whipped Cream & Other Delights | Herb Alpert & The Tijuana Brass |
| 1967 | More of The Monkees | The Monkees |
| 1968 | Are You Experienced? | The Jimi Hendrix Experience |
| 1969 | In-A-Gadda-Da-Vida | Iron Butterfly |
| 1970 | Bridge Over Troubled Water | Simon and Garfunkel |
| 1971 | Jesus Christ Superstar | Various Artists |
| 1972 | Harvest | Neil Young |
| 1973 | The World is a Ghetto | War |
| 1974 | Goodbye Yellow Brick Road | Elton John |
| 1975 | Elton John’s Greatest Hits | Elton John |
| 1976 | Peter Frampton | Frampton Comes Alive |
| 1977 | Rumours | Fleetwood Mac |
| 1978 | Saturday Night Fever | Bee Gees |
| 1979 | Billy Joel | 52nd Street |
Session info
sessionInfo()## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=German_Switzerland.1252 LC_CTYPE=German_Switzerland.1252
## [3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C
## [5] LC_TIME=German_Switzerland.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.1.0 dplyr_0.8.0.1 gapminder_0.3.0 kableExtra_1.0.1
## [5] knitr_1.21
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.0 highr_0.7 plyr_1.8.4
## [4] pillar_1.3.1 compiler_3.5.2 prettydoc_0.2.1
## [7] tools_3.5.2 digest_0.6.18 gtable_0.2.0
## [10] evaluate_0.12 tibble_2.0.1 viridisLite_0.3.0
## [13] pkgconfig_2.0.2 rlang_0.3.1 cli_1.0.1
## [16] rstudioapi_0.9.0 yaml_2.2.0 xfun_0.4
## [19] withr_2.1.2 httr_1.4.0 stringr_1.4.0
## [22] xml2_1.2.0 hms_0.4.2 webshot_0.5.1
## [25] grid_3.5.2 tidyselect_0.2.5 glue_1.3.0
## [28] R6_2.4.0 fansi_0.4.0 rmarkdown_1.11
## [31] readr_1.3.1 purrr_0.3.0 magrittr_1.5
## [34] codetools_0.2-15 scales_1.0.0 htmltools_0.3.6
## [37] assertthat_0.2.0 rvest_0.3.2 colorspace_1.4-0
## [40] utf8_1.1.4 stringi_1.3.1 lazyeval_0.2.1
## [43] munsell_0.5.0 crayon_1.3.4